Overview

Dataset Statistics

Number of Variables 21
Number of Rows 10866
Missing Cells 13434
Missing Cells (%) 5.9%
Duplicate Rows 1
Duplicate Rows (%) 0.0%
Total Size in Memory 13.8 MB
Average Row Size in Memory 1.3 KB
Variable Types
  • Numerical: 10
  • Categorical: 11

Dataset Insights

budget and budget_adj have similar distributions Similar Distribution
revenue and revenue_adj have similar distributions Similar Distribution
homepage has 7930 (72.98%) missing values Missing
tagline has 2824 (25.99%) missing values Missing
keywords has 1493 (13.74%) missing values Missing
production_companies has 1030 (9.48%) missing values Missing
id is skewed Skewed
popularity is skewed Skewed
budget is skewed Skewed
revenue is skewed Skewed
runtime is skewed Skewed
vote_count is skewed Skewed
release_year is skewed Skewed
budget_adj is skewed Skewed
revenue_adj is skewed Skewed
imdb_id has a high cardinality: 10855 distinct values High Cardinality
original_title has a high cardinality: 10571 distinct values High Cardinality
cast has a high cardinality: 10719 distinct values High Cardinality
homepage has a high cardinality: 2896 distinct values High Cardinality
director has a high cardinality: 5067 distinct values High Cardinality
tagline has a high cardinality: 7997 distinct values High Cardinality
keywords has a high cardinality: 8804 distinct values High Cardinality
overview has a high cardinality: 10847 distinct values High Cardinality
genres has a high cardinality: 2039 distinct values High Cardinality
production_companies has a high cardinality: 7445 distinct values High Cardinality
release_date has a high cardinality: 5909 distinct values High Cardinality
imdb_id has constant length 9 Constant Length
budget has 5696 (52.42%) zeros Zeros
revenue has 6016 (55.37%) zeros Zeros
budget_adj has 5696 (52.42%) zeros Zeros
revenue_adj has 6016 (55.37%) zeros Zeros
  • 1
  • 2
  • 3
  • 4

Variables


id

numerical

Approximate Distinct Count 10865
Approximate Unique (%) 100.0%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 173856
Mean 66064.1774
Minimum 5
Maximum 417859
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • id is skewed right (γ1 = 1.7321)

Quantile Statistics

Minimum 5
5-th Percentile 1221
Q1 10596.25
Median 20669
Q3 75610
95-th Percentile 288556
Maximum 417859
Range 417854
IQR 65013.75

Descriptive Statistics

Mean 66064.1774
Standard Deviation 92130.1366
Variance 8.488e+09
Sum 7.1785e+08
Skewness 1.7321
Kurtosis 1.7805
Coefficient of Variation 1.3946
  • id is not normally distributed (p-value 1.6758738044415097e-18)
  • id has 1606 outliers

imdb_id

categorical

Approximate Distinct Count 10855
Approximate Unique (%) 100.0%
Missing 10
Missing (%) 0.1%
Memory Size 803344
  • The largest value (tt0411951) is over 2.0 times larger than the second largest value (tt0035423)

Length

Mean 9
Standard Deviation 0
Median 9
Minimum 9
Maximum 9

Sample

1st row tt0369610
2nd row tt1392190
3rd row tt2908446
4th row tt2488496
5th row tt2820852

Letter

Count 21712
Lowercase Letter 21712
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 75992
  • The largest value (tt0411951) is over 2.0 times larger than the second largest value (tt0035423)
  • imdb_id has words of constant length

popularity

numerical

Approximate Distinct Count 10814
Approximate Unique (%) 99.5%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 173856
Mean 0.6464
Minimum 6.5e-05
Maximum 32.9858
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • popularity is skewed right (γ1 = 9.875)

Quantile Statistics

Minimum 6.5e-05
5-th Percentile 0.06425
Q1 0.2076
Median 0.3839
Q3 0.7138
95-th Percentile 2.0466
Maximum 32.9858
Range 32.9857
IQR 0.5062

Descriptive Statistics

Mean 0.6464
Standard Deviation 1.0002
Variance 1.0004
Sum 7024.2274
Skewness 9.875
Kurtosis 210.9005
Coefficient of Variation 1.5472
  • popularity is not normally distributed (p-value 8.067449186875745e-24)
  • popularity has 946 outliers

budget

numerical

Approximate Distinct Count 557
Approximate Unique (%) 5.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 173856
Mean 1.4626e+07
Minimum 0
Maximum 425000000
Zeros 5696
Zeros (%) 52.4%
Negatives 0
Negatives (%) 0.0%
  • budget is skewed right (γ1 = 3.7167)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 1.5e+07
95-th Percentile 7.5e+07
Maximum 425000000
Range 425000000
IQR 1.5e+07

Descriptive Statistics

Mean 1.4626e+07
Standard Deviation 3.0913e+07
Variance 9.5563e+14
Sum 1.5892e+11
Skewness 3.7167
Kurtosis 19.26
Coefficient of Variation 2.1136
  • budget is not normally distributed (p-value 1.8454215916519603e-24)
  • budget has 1370 outliers

revenue

numerical

Approximate Distinct Count 4702
Approximate Unique (%) 43.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 173856
Mean 3.9823e+07
Minimum 0
Maximum 2781505847
Zeros 6016
Zeros (%) 55.4%
Negatives 0
Negatives (%) 0.0%
  • revenue is skewed right (γ1 = 6.6575)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 2.4e+07
95-th Percentile 2.1367e+08
Maximum 2781505847
Range 2781505847
IQR 2.4e+07

Descriptive Statistics

Mean 3.9823e+07
Standard Deviation 1.17e+08
Variance 1.369e+16
Sum 4.3272e+11
Skewness 6.6575
Kurtosis 73.1343
Coefficient of Variation 2.9381
  • revenue is not normally distributed (p-value 6.5702137455354245e-25)
  • revenue has 1736 outliers

original_title

categorical

Approximate Distinct Count 10571
Approximate Unique (%) 97.3%
Missing 0
Missing (%) 0.0%
Memory Size 892011

Length

Mean 16.0025
Standard Deviation 9.117
Median 14
Minimum 1
Maximum 104

Sample

1st row Jurassic World
2nd row Mad Max: Fury Road
3rd row Insurgent
4th row Star Wars: The For...
5th row Furious 7

Letter

Count 148695
Lowercase Letter 121760
Space Separator 20179
Uppercase Letter 26935
Dash Punctuation 206
Decimal Number 1142
  • The largest value (the) is over 9.81 times larger than the second largest value (a)

cast

categorical

Approximate Distinct Count 10719
Approximate Unique (%) 99.3%
Missing 76
Missing (%) 0.7%
Memory Size 1547952

Length

Mean 67.8726
Standard Deviation 10.523
Median 69
Minimum 7
Maximum 110

Sample

1st row Chris Pratt|Bryce ...
2nd row Tom Hardy|Charlize...
3rd row Shailene Woodley|T...
4th row Harrison Ford|Mark...
5th row Vin Diesel|Paul Wa...

Letter

Count 628932
Lowercase Letter 517850
Space Separator 55922
Uppercase Letter 111082
Dash Punctuation 804
Decimal Number 25

homepage

categorical

Approximate Distinct Count 2896
Approximate Unique (%) 98.6%
Missing 7930
Missing (%) 73.0%
Memory Size 299997

Length

Mean 37.1461
Standard Deviation 13.0008
Median 34
Minimum 13
Maximum 242

Sample

1st row http://www.jurassi...
2nd row http://www.madmaxm...
3rd row http://www.thedive...
4th row http://www.starwar...
5th row http://www.furious...

Letter

Count 87490
Lowercase Letter 86843
Space Separator 0
Uppercase Letter 647
Dash Punctuation 1265
Decimal Number 1247

director

categorical

Approximate Distinct Count 5067
Approximate Unique (%) 46.8%
Missing 44
Missing (%) 0.4%
Memory Size 878014

Length

Mean 14.5581
Standard Deviation 10.1672
Median 13
Minimum 2
Maximum 533

Sample

1st row Colin Trevorrow
2nd row George Miller
3rd row Robert Schwentke
4th row J.J. Abrams
5th row James Wan

Letter

Count 141844
Lowercase Letter 116535
Space Separator 12934
Uppercase Letter 25309
Dash Punctuation 177
Decimal Number 1

tagline

categorical

Approximate Distinct Count 7997
Approximate Unique (%) 99.4%
Missing 2824
Missing (%) 26.0%
Memory Size 887683
  • The largest value (Based on a true story.) is over 1.67 times larger than the second largest value (Be careful what you wish for.)

Length

Mean 44.1745
Standard Deviation 25.5953
Median 38
Minimum 1
Maximum 286

Sample

1st row The park is open.
2nd row What a Lovely Day.
3rd row One Choice Can Des...
4th row Every generation h...
5th row Vengeance Hits Hom...

Letter

Count 279628
Lowercase Letter 258497
Space Separator 57392
Uppercase Letter 21131
Dash Punctuation 364
Decimal Number 950
  • The largest value (the) is over 1.81 times larger than the second largest value (a)

keywords

categorical

Approximate Distinct Count 8804
Approximate Unique (%) 93.9%
Missing 1493
Missing (%) 13.7%
Memory Size 1008908
  • The largest value (woman director) is over 1.63 times larger than the second largest value (independent film)

Length

Mean 41.9725
Standard Deviation 17.876
Median 43
Minimum 2
Maximum 131

Sample

1st row monster|dna|tyrann...
2nd row future|chase|post-...
3rd row based on novel|rev...
4th row android|spaceship|...
5th row car race|speed|rev...

Letter

Count 346215
Lowercase Letter 346215
Space Separator 17513
Uppercase Letter 0
Dash Punctuation 570
Decimal Number 366
  • The largest value (director) is over 1.54 times larger than the second largest value (film)

overview

categorical

Approximate Distinct Count 10847
Approximate Unique (%) 99.9%
Missing 4
Missing (%) 0.0%
Memory Size 4898037
  • The largest value (No overview found.) is over 6.5 times larger than the second largest value (1960. The thrilling battles waged by a band of kids from two rival villages in the southern French countryside.)

Length

Mean 307.0376
Standard Deviation 171.9178
Median 282
Minimum 13
Maximum 1000

Sample

1st row Twenty-two years a...
2nd row An apocalyptic sto...
3rd row Beatrice Prior mus...
4th row Thirty years after...
5th row Deckard Shaw seeks...

Letter

Count 2679264
Lowercase Letter 2589207
Space Separator 556977
Uppercase Letter 90057
Dash Punctuation 9124
Decimal Number 8887

runtime

numerical

Approximate Distinct Count 247
Approximate Unique (%) 2.3%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 173856
Mean 102.0709
Minimum 0
Maximum 900
Zeros 31
Zeros (%) 0.3%
Negatives 0
Negatives (%) 0.0%
  • runtime is skewed right (γ1 = 6.103)

Quantile Statistics

Minimum 0
5-th Percentile 75
Q1 90
Median 99
Q3 111
95-th Percentile 139
Maximum 900
Range 900
IQR 21

Descriptive Statistics

Mean 102.0709
Standard Deviation 31.3814
Variance 984.7926
Sum 1.1091e+06
Skewness 6.103
Kurtosis 116.1835
Coefficient of Variation 0.3074
  • runtime is not normally distributed (p-value 1.971475253967209e-19)
  • runtime has 781 outliers

genres

categorical

Approximate Distinct Count 2039
Approximate Unique (%) 18.8%
Missing 23
Missing (%) 0.2%
Memory Size 905793

Length

Mean 18.5371
Standard Deviation 9.8553
Median 17
Minimum 3
Maximum 51

Sample

1st row Action|Adventure|S...
2nd row Action|Adventure|S...
3rd row Adventure|Science ...
4th row Action|Adventure|S...
5th row Action|Crime|Thril...

Letter

Count 183484
Lowercase Letter 154960
Space Separator 1397
Uppercase Letter 28524
Dash Punctuation 0
Decimal Number 0

production_companies

categorical

Approximate Distinct Count 7445
Approximate Unique (%) 75.7%
Missing 1030
Missing (%) 9.5%
Memory Size 1120457

Length

Mean 45.5162
Standard Deviation 28.8952
Median 39
Minimum 3
Maximum 184

Sample

1st row Universal Studios|...
2nd row Village Roadshow P...
3rd row Summit Entertainme...
4th row Lucasfilm|Truenort...
5th row Universal Pictures...

Letter

Count 392845
Lowercase Letter 329899
Space Separator 34207
Uppercase Letter 62946
Dash Punctuation 848
Decimal Number 1645

release_date

categorical

Approximate Distinct Count 5909
Approximate Unique (%) 54.4%
Missing 0
Missing (%) 0.0%
Memory Size 803650

Length

Mean 8.9601
Standard Deviation 0.6429
Median 9
Minimum 8
Maximum 10

Sample

1st row 6/9/2015
2nd row 5/13/2015
3rd row 3/18/2015
4th row 12/15/2015
5th row 4/1/2015

Letter

Count 0
Lowercase Letter 0
Space Separator 0
Uppercase Letter 0
Dash Punctuation 0
Decimal Number 75628

vote_count

numerical

Approximate Distinct Count 1289
Approximate Unique (%) 11.9%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 173856
Mean 217.3897
Minimum 10
Maximum 9767
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • vote_count is skewed right (γ1 = 6.1765)

Quantile Statistics

Minimum 10
5-th Percentile 11
Q1 17
Median 38
Q3 145.75
95-th Percentile 1025.75
Maximum 9767
Range 9757
IQR 128.75

Descriptive Statistics

Mean 217.3897
Standard Deviation 575.6191
Variance 331337.2996
Sum 2.3622e+06
Skewness 6.1765
Kurtosis 53.3359
Coefficient of Variation 2.6479
  • vote_count is not normally distributed (p-value 7.728296449859751e-25)
  • vote_count has 1518 outliers

vote_average

numerical

Approximate Distinct Count 72
Approximate Unique (%) 0.7%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 173856
Mean 5.9749
Minimum 1.5
Maximum 9.2
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • vote_average is skewed left (γ1 = -0.4358)

Quantile Statistics

Minimum 1.5
5-th Percentile 4.4
Q1 5.4
Median 6
Q3 6.6
95-th Percentile 7.3
Maximum 9.2
Range 7.7
IQR 1.2

Descriptive Statistics

Mean 5.9749
Standard Deviation 0.9351
Variance 0.8745
Sum 64923.5
Skewness -0.4358
Kurtosis 0.5427
Coefficient of Variation 0.1565
  • vote_average is not normally distributed (p-value 0.0009273853229668368)
  • vote_average has 197 outliers

release_year

numerical

Approximate Distinct Count 56
Approximate Unique (%) 0.5%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 173856
Mean 2001.3227
Minimum 1960
Maximum 2015
Zeros 0
Zeros (%) 0.0%
Negatives 0
Negatives (%) 0.0%
  • release_year is skewed left (γ1 = -1.2041)

Quantile Statistics

Minimum 1960
5-th Percentile 1973
Q1 1995
Median 2006
Q3 2011
95-th Percentile 2015
Maximum 2015
Range 55
IQR 16

Descriptive Statistics

Mean 2001.3227
Standard Deviation 12.8129
Variance 164.1714
Sum 2.1746e+07
Skewness -1.2041
Kurtosis 0.7991
Coefficient of Variation 0.006402
  • release_year is not normally distributed (p-value 3.627365176309898e-10)
  • release_year has 403 outliers

budget_adj

numerical

Approximate Distinct Count 2614
Approximate Unique (%) 24.1%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 173856
Mean 1.7551e+07
Minimum 0
Maximum 4.25e+08
Zeros 5696
Zeros (%) 52.4%
Negatives 0
Negatives (%) 0.0%
  • budget_adj is skewed right (γ1 = 3.1145)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 2.0853e+07
95-th Percentile 8.9375e+07
Maximum 4.25e+08
Range 4.25e+08
IQR 2.0853e+07

Descriptive Statistics

Mean 1.7551e+07
Standard Deviation 3.4306e+07
Variance 1.1769e+15
Sum 1.9071e+11
Skewness 3.1145
Kurtosis 13.0304
Coefficient of Variation 1.9547
  • budget_adj is not normally distributed (p-value 1.927960547284866e-24)
  • budget_adj has 1231 outliers

revenue_adj

numerical

Approximate Distinct Count 4840
Approximate Unique (%) 44.5%
Missing 0
Missing (%) 0.0%
Infinite 0
Infinite (%) 0.0%
Memory Size 173856
Mean 5.1364e+07
Minimum 0
Maximum 2.8271e+09
Zeros 6016
Zeros (%) 55.4%
Negatives 0
Negatives (%) 0.0%
  • revenue_adj is skewed right (γ1 = 6.2503)

Quantile Statistics

Minimum 0
5-th Percentile 0
Q1 0
Median 0
Q3 3.3697e+07
95-th Percentile 2.7655e+08
Maximum 2.8271e+09
Range 2.8271e+09
IQR 3.3697e+07

Descriptive Statistics

Mean 5.1364e+07
Standard Deviation 1.4463e+08
Variance 2.0919e+16
Sum 5.5813e+11
Skewness 6.2503
Kurtosis 63.3502
Coefficient of Variation 2.8158
  • revenue_adj is not normally distributed (p-value 7.859509192169205e-25)
  • revenue_adj has 1689 outliers

Interactions

Correlations

Missing Values